29 research outputs found

    On Static Timing Analysis of GPU Kernels

    Get PDF
    We study static timing analysis of programs running on GPU accelerators. Such programs follow a data parallel programming model that allows massive parallelism on manycore processors. Data parallel programming and GPUs as accelerators have received wide use during the recent years. The timing analysis of programs running on single core machines is well known and applied also in practice. However for multicore and manycore machines, timing analysis presents a significant but yet not properly solved problem. In this paper, we present static timing analysis of GPU kernels based on a method that we call abstract CTA simulation. Cooperative Thread Arrays (CTA) are the basic execution structure that GPU devices use in their operation that proceeds in thread groups called warps. Abstract CTA simulation is based on static analysis of thread divergence in warps and their abstract scheduling

    Using static program analysis to compile fast cache simulators

    Get PDF
    This thesis presents a generic approach towards compiling fast execution-driven simulators, and applies this to cache simulation of programs. The resulting cache simulation method reduces the time needed for cache performance evaluations without losing the accuracy of the results. Fast cache simulators are needed in the performance analysis of software systems. To properly understand the cache behavior caused by a program, simulations must be performed with a sufficient number of inputs. Traditional simulation of memory operations of a program can be orders of magnitude slower than the execution of the program. This leads to simulation times that are often infeasible in software development. The approach of this thesis is based on using static cache analysis to guide partial evaluation and slicing of simulators. Because of redundancy in memory access patterns of typical programs, an execution-driven cache simulator program can be partially evaluated during its compilation. Program slicing can be used to remove the computations that have no effect on the simulation result. The static cache analysis presented in this thesis is generic. The analysis is designed especially for programs that use dynamic addressing. The thesis assumes an address analysis that gives the cache analysis static information about cache aliases and cache conflicts between accessed memory lines. To determine the memory references that always cause cache hits or cache misses, the thesis describes both must and may analyses of cache states. The cache state analysis is built by using abstract interpretation. Based on the use of abstract interpretation, the soundness of the analysis is proved. The potential performance of the method was experimentally evaluated. The thesis describes both a tool set implementing the cache analysis method and experiments done with the tool set. The experiments indicate that a simple implementation is capable of significantly speeding up the simulations.reviewe

    Transitive closure algorithm MEMTC and its performance analysis

    Get PDF
    AbstractIn this paper, we present a new algorithm for computing the full transitive closure designed for operation in layered memories. The algorithm is based on strongly connected component detection and on a very compact representation of data. We analyze the average-case performance of the algorithm experimentally in an environment where two layers of memory of different speed are used. In our analysis, we use trace-based simulation of memory operations

    Experience in Performance Analysis of Large Real-Time Systems

    No full text
    In this paper, we discuss the experience we gained during three performance engineering projects that we did in co-operation with telecommunication industry. In each of the projects, we analyzed the performance of a real-time system, which has large embedded software. Each system was at a different stage of development. We used three different modeling techniques: Queueing networks, execution graphs, and message sequence charts. We also applied different approaches of building models. To give guidelines for analyzing large systems, we discuss our method of analysis, rationale for choices we made, and the lessons we learned. 1. Introduction In this paper, we discuss performance analysis of large realtime systems. The discussion is based on three case studies; each of which were performance analyses of telecommunication systems. The performance analyses of the systems were done in close co-operation with Finnish telecommunication industry. All the analyzed systems were embedded telecomm..

    DBE: A Tool for Trace Driven Memory Simulation

    No full text
    DBE is an experimental tool designed for trace driven simulation of processor caches and disk buffers. Trace driven simulation is flexible and requires no special hardware, but generating traces can be too slow and the resulting traces too large to handle. To overcome this problem, DBE uses a compile-time trace compaction, which can yield smaller traces and faster run times
    corecore